In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style(style="darkgrid")
# plt.rcParams['figure.figsize'] = [12.0, 8.0]   # make plots size, double of the notebook normal
plt.rcParams['figure.figsize'] = [9.0, 6.0]   # make plots size, double of the notebook normal

from IPython.display import display, HTML      # to use display() to always have well formatted html table output

Dataset Description

Variable Name Definition
PassengerId A unique ID to each Passenger; 1-891
Survived A boolean variable; 1 - Survived, 0 - Dead
Pclass Ticket Class; 1 - 1st, 2 - 2nd, 3 - 3rd class
Name Passenger Name
Sex Sex of Passenger
Age Age in Years
SibSp Number of Siblings / Spouses Aboard
Parch Number of parents / children aboard the titanic
Ticket Ticket number
Fare Passenger Fare
Cabin Cabin number
Embarked Port of Embarkation; C - Cherbourg, Q - Queenstown, S - Southampton


Some Notes Regarding Dataset

Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

SibSp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

*Source: [Kaggle's - Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data)*



Loading data & Preview


In [3]:
titanic_df = pd.read_csv("titanic-data.csv", index_col=["PassengerId"])
titanic_df.head()


Out[3]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

In [4]:
titanic_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

Dataset summary above shows that there are 891 entries.

However, from above we can also see that we have missing values in - Age, Cabin, Embarked columns.

  • Missing values of Cabin and Embarked will not be fixed - because no questions are based on these factors
  • Missing values of Age will be fixed now - because involved in various questions and analysis below.

Fix missing ages

To review the data by distributions, and to tackle various questions - we first need to deal with this issue of missing ages.

If we assume that the missing ages will be distributed similarly, to the values that are present - then we can substitue values that represent the existing distribution.

For this we can replace the missing values with the mean.

To have best representative values populated - we will taken mean based on Sex and Pclass. In other words the mean of ages for Sex within the Pclass, and when replacing the missing age, these two factors will be kept in consideration - to use the related mean of ages.


In [5]:
mean_ages = titanic_df.groupby(['Sex','Pclass'])['Age'].mean()
display(mean_ages)


Sex     Pclass
female  1         34.611765
        2         28.722973
        3         21.750000
male    1         41.281386
        2         30.740707
        3         26.507589
Name: Age, dtype: float64

In [6]:
def replace_nan_age(row):
    if pd.isnull(row['Age']):
        return mean_ages[row['Sex'], row['Pclass']]
    else:
        return row['Age']
    
titanic_df['Age'] = titanic_df.apply(replace_nan_age, axis=1)
titanic_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

From the above we can see that the missing ages have been filled.


In [7]:
titanic_df.describe()


Out[7]:
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.318643 0.523008 0.381594 32.204208
std 0.486592 0.836071 13.281103 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 21.750000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 26.507589 0.000000 0.000000 14.454200
75% 1.000000 3.000000 36.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [8]:
titanic_df.Parch.hist()
plt.xlabel('Parch')
plt.ylabel('Passengers')
plt.title('Number of parents / children aboard')


Out[8]:
<matplotlib.text.Text at 0x962ffd0>

In [9]:
titanic_df.SibSp.hist()
plt.xlabel('SibSp')
plt.ylabel('Passengers')
plt.title('Number of Siblings / Spouses aboard')


Out[9]:
<matplotlib.text.Text at 0xb58d710>

From the above we notice

  • Oldest passenger was 80 years old
  • Youngest passenger was about 5 months old
  • Average age of passengers was 29.32 - but note this also has missing ages
  • Mean survival is 0.3838
  • Max fare charged was $512.33
  • Maximum number of Siblings / Spouses were 8
  • Maximum number of Parent / Child were 6

Questions in mind

  1. Did passenger class made any difference to his survival?
  2. Which gender had more survival?
  3. Person travelling with others had more survival possibility?
  4. Which age group had better chance of survival?
  5. What was male and female survival per class and by age?

Question 1 - Did passenger class made any difference to his survival?


In [10]:
## SUBSET DATAFRAME TO JUST THE REQUIRED DATA

survived_plass_df = titanic_df[['Survived', 'Pclass']]          # works - just have to say the columns required
survived_plass_df.head()

## GROUP DATA TO CALCULATE SURVIVED & TOTAL BY PCLASS

## calculate survived by pclass
survived_by_pclass = survived_plass_df.groupby(['Pclass']).sum()
total_by_pclass = survived_plass_df.groupby(['Pclass']).count()

# total are showed as survived - so change to column name Total
total_by_pclass.rename(columns = {'Survived':'Total'}, inplace = True)

# merge separate data into one dataframe
survived_total_by_pclass = pd.merge(survived_by_pclass, total_by_pclass, left_index=True, right_index=True) # merge by index
survived_total_by_pclass


Out[10]:
Survived Total
Pclass
1 136 216
2 87 184
3 119 491

In [11]:
percent_survived = (survived_total_by_pclass['Survived'] / survived_total_by_pclass['Total']) * 100
survived_total_by_pclass['Percentage'] = percent_survived

survived_total_by_pclass


Out[11]:
Survived Total Percentage
Pclass
1 136 216 62.962963
2 87 184 47.282609
3 119 491 24.236253

In [12]:
x = survived_total_by_pclass.index.values
ht = survived_total_by_pclass.Total
hs = survived_total_by_pclass.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, x)
plt.xlabel('Pclass')
plt.ylabel('Passengers')
plt.title('Survivors by Class')


plt.legend([pht,phs],['Died', 'Survived'])


Out[12]:
<matplotlib.legend.Legend at 0xb74e748>

Conclusion

As can be seen from the visualization and also from the dataframe table above - 1st Class passengers had highest rate of survival, then 2nd class passengers, and the least survival rates was of 3rd class passengers. A large number of passengers were travelling in 3rd class (491), but only 24.24% survived.

Question 2 - Which gender had more survival?


In [13]:
## CALCULATE SURVIVED AND TOTAL BY SEX

# groupby Sex
group_by_sex = titanic_df.groupby('Sex')

# calculate survived by sex
survived_by_sex = group_by_sex['Survived'].sum()
survived_by_sex.name = 'Survived'
display(survived_by_sex)

# calculate total by sex
total_by_sex = group_by_sex['Survived'].size()
total_by_sex.name = 'Total'
display(total_by_sex)

# concat the separate results into one dataframe
survived_total_by_sex = pd.concat([survived_by_sex, total_by_sex], axis=1)
survived_total_by_sex


Sex
female    233
male      109
Name: Survived, dtype: int64
Sex
female    314
male      577
Name: Total, dtype: int64
Out[13]:
Survived Total
Sex
female 233 314
male 109 577

In [14]:
percent_survived = (survived_total_by_sex['Survived'] / survived_total_by_sex['Total']) * 100
survived_total_by_sex['Percentage'] = percent_survived

survived_total_by_sex


Out[14]:
Survived Total Percentage
Sex
female 233 314 74.203822
male 109 577 18.890815

Now lets visualize it


In [15]:
x = range(len(survived_total_by_sex.index.values))
ht = survived_total_by_sex.Total
hs = survived_total_by_sex.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, survived_total_by_sex.index.values)
plt.xlabel('Sex')
plt.ylabel('Passengers')
plt.title('Survivors by Gender')

plt.legend([pht,phs],['Died', 'Survived'])


Out[15]:
<matplotlib.legend.Legend at 0xb94b5c0>

Conclusion

From the visualization and percentage of survival from the dataframe printout above - we can see that females had very high rate of survival. Female survial rate was 74.3%, and male survival rate was 18.9% - so female survival rate was about 4 times that of males.

It can be concluded that females were given preference in rescue operations, and males must have sacrificed themselves to let the females survive.

Question 3 - Person travelling with others had more survival possibility?

Lets first reivew the distribution of those who were alone, and those who were in company.


In [16]:
is_not_alone = (titanic_df.SibSp + titanic_df.Parch) >= 1
passengers_not_alone = titanic_df[is_not_alone]

is_alone = (titanic_df.SibSp + titanic_df.Parch) == 0
passengers_alone = titanic_df[is_alone]

print('Not alone - describe')
display(passengers_not_alone.describe())
print('Alone - describe')
display(passengers_alone.describe())


Not alone - describe
Survived Pclass Age SibSp Parch Fare
count 354.000000 354.000000 354.000000 354.000000 354.000000 354.000000
mean 0.505650 2.169492 26.316614 1.316384 0.960452 48.832275
std 0.500676 0.864520 14.901225 1.420774 1.039512 55.307615
min 0.000000 1.000000 0.420000 0.000000 0.000000 6.495800
25% 0.000000 1.000000 17.000000 1.000000 0.000000 18.000000
50% 1.000000 2.000000 26.000000 1.000000 1.000000 27.750000
75% 1.000000 3.000000 36.000000 1.000000 2.000000 59.044800
max 1.000000 3.000000 70.000000 8.000000 6.000000 512.329200
Alone - describe
Survived Pclass Age SibSp Parch Fare
count 537.000000 537.000000 537.000000 537.0 537.0 537.000000
mean 0.303538 2.400372 31.297634 0.0 0.0 21.242689
std 0.460214 0.804511 11.694910 0.0 0.0 42.223510
min 0.000000 1.000000 5.000000 0.0 0.0 0.000000
25% 0.000000 2.000000 23.000000 0.0 0.0 7.775000
50% 0.000000 3.000000 27.000000 0.0 0.0 8.137500
75% 1.000000 3.000000 36.000000 0.0 0.0 15.000000
max 1.000000 3.000000 80.000000 0.0 0.0 512.329200

In [17]:
passengers_not_alone.Age.hist(label='Not alone')
passengers_alone.Age.hist(label='Alone', alpha=0.6)

plt.xlabel('Age')
plt.ylabel('Passengers')
plt.legend(loc='best')
plt.title('Alone & Not Alone Passenger\'s Ages')


Out[17]:
<matplotlib.text.Text at 0xbcfb400>

From the above distribution we can see that

  • Those in age range of 0-10, that is kids, were not alone - which makes sense
  • There however is one kid age 5 who was alone
  • There was an 80 year old person also who was alone
  • 537 passengers were alone, whereas 354 were in company
  • Except for age group 0-10, for all other age groups, those travelling alone outnumbered those travelling in company

Now lets review these by their survival


In [18]:
notalone = np.where((titanic_df.SibSp + titanic_df.Parch) >= 1, 'Not Alone', 'Alone')
loneliness_summary = titanic_df.groupby(notalone, as_index=False)['Survived'].agg([np.sum, np.size])
loneliness_summary = loneliness_summary.rename(columns={'sum':'Survived', 'size':'Total'})

loneliness_summary


Out[18]:
Survived Total
Alone 163 537
Not Alone 179 354

In [19]:
loneliness_summary['Percent survived'] = (loneliness_summary.Survived / loneliness_summary.Total) * 100

loneliness_summary


Out[19]:
Survived Total Percent survived
Alone 163 537 30.353818
Not Alone 179 354 50.564972

Now lets visualize


In [20]:
x = range(len(loneliness_summary.index.values))
ht = loneliness_summary.Total
hs = loneliness_summary.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, loneliness_summary.index.values)
plt.xlabel('Alone / Not Alone')
plt.ylabel('Passengers')
plt.title('Survivors by Alone / Not Alone')


plt.legend([pht,phs],['Died', 'Survived'])


Out[20]:
<matplotlib.legend.Legend at 0xb9bd6a0>

Conclusion

Percentage above and visualizations above clearly indicate that people having company had higher survival rate.

Question 4 - Which age group had a better chance of survival?

First lets review gender age distribution


In [21]:
male_ages = (titanic_df[titanic_df.Sex == 'male'])['Age']
male_ages.describe()


Out[21]:
count    577.000000
mean      30.423672
std       13.264336
min        0.420000
25%       23.000000
50%       27.000000
75%       37.000000
max       80.000000
Name: Age, dtype: float64

In [22]:
female_ages = (titanic_df[titanic_df.Sex == 'female'])['Age']
female_ages.describe()


Out[22]:
count    314.000000
mean      27.288063
std       13.091327
min        0.750000
25%       21.000000
50%       24.000000
75%       35.000000
max       63.000000
Name: Age, dtype: float64

In [23]:
male_ages.hist(label='Male')
female_ages.hist(label='Female')

plt.xlabel('Age')
plt.ylabel('Passengers')
plt.title('Male & Female passenger ages')
plt.legend(loc='best')


Out[23]:
<matplotlib.legend.Legend at 0xb72a400>

From above distribution, we can see that:

  • For every age group the number of females was less than number of males
  • The age of oldest female was 63, whereas age of oldest male was 80

Now lets do survival analysis by the age group


In [24]:
def age_group(age):
    if age >= 80:
        return '80-89'
    if age >= 70:
        return '70-79'
    if age >= 60:
        return '60-69'
    if age >= 50:
        return '50-59'
    if age >= 40:
        return '40-49'
    if age >= 30:
        return '30-39'
    if age >= 20:
        return '20-29'
    if age >= 10:
        return '10-19'
    if age >= 0:
        return '0-9'
    
titanic_df['AgeGroup'] = titanic_df.Age.apply(age_group)
titanic_df.head()


Out[24]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeGroup
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 20-29
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 30-39
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 20-29
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 30-39
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 30-39

In [25]:
age_group_summary = titanic_df.groupby(['AgeGroup'], as_index=False)['Survived'].agg([np.sum, np.size])
age_group_summary = age_group_summary.rename(columns={'sum':'Survived', 'size':'Total'})
age_group_summary


Out[25]:
Survived Total
AgeGroup
0-9 38 62
10-19 41 102
20-29 113 358
30-39 84 185
40-49 39 110
50-59 20 48
60-69 6 19
70-79 0 6
80-89 1 1

In [26]:
x = range(len(age_group_summary.index.values))
ht = age_group_summary.Total
hs = age_group_summary.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, age_group_summary.index.values)
plt.xlabel('Age groups')
plt.ylabel('Passengers')
plt.title('Survivors by Age group')


plt.legend([pht,phs],['Died', 'Survived'])


Out[26]:
<matplotlib.legend.Legend at 0xc6574a8>

In [27]:
age_group_summary['SurvivedPercent'] = (age_group_summary.Survived / age_group_summary.Total) * 100
age_group_summary['DiedPercent'] = ((age_group_summary.Total - age_group_summary.Survived) / age_group_summary.Total) * 100
age_group_summary


Out[27]:
Survived Total SurvivedPercent DiedPercent
AgeGroup
0-9 38 62 61.290323 38.709677
10-19 41 102 40.196078 59.803922
20-29 113 358 31.564246 68.435754
30-39 84 185 45.405405 54.594595
40-49 39 110 35.454545 64.545455
50-59 20 48 41.666667 58.333333
60-69 6 19 31.578947 68.421053
70-79 0 6 0.000000 100.000000
80-89 1 1 100.000000 0.000000

From the above visualization and percentages we can see that most survivors were from 20-29 age group.

But interestingly survival percentage of 0-9 age group is best - at 61.29%.

Also above we have seen that female had better survial rate - so these survial rates must be mix of male and female survival rates - and hence to have better view, the gender aspect should also be taken into consideration.


In [28]:
sex_agegroup_summary = titanic_df.groupby(['Sex','AgeGroup'], as_index=False)['Survived'].mean()

sex_agegroup_summary


Out[28]:
Sex AgeGroup Survived
0 female 0-9 0.633333
1 female 10-19 0.755556
2 female 20-29 0.681034
3 female 30-39 0.855072
4 female 40-49 0.687500
5 female 50-59 0.888889
6 female 60-69 1.000000
7 male 0-9 0.593750
8 male 10-19 0.122807
9 male 20-29 0.140496
10 male 30-39 0.215517
11 male 40-49 0.217949
12 male 50-59 0.133333
13 male 60-69 0.133333
14 male 70-79 0.000000
15 male 80-89 1.000000

In [29]:
male_agegroup_summary = sex_agegroup_summary[sex_agegroup_summary['Sex'] == 'male']

male_agegroup_summary


Out[29]:
Sex AgeGroup Survived
7 male 0-9 0.593750
8 male 10-19 0.122807
9 male 20-29 0.140496
10 male 30-39 0.215517
11 male 40-49 0.217949
12 male 50-59 0.133333
13 male 60-69 0.133333
14 male 70-79 0.000000
15 male 80-89 1.000000

In [30]:
female_agegroup_summary = sex_agegroup_summary[sex_agegroup_summary['Sex'] == 'female']

female_agegroup_summary


Out[30]:
Sex AgeGroup Survived
0 female 0-9 0.633333
1 female 10-19 0.755556
2 female 20-29 0.681034
3 female 30-39 0.855072
4 female 40-49 0.687500
5 female 50-59 0.888889
6 female 60-69 1.000000

In [31]:
age_group = titanic_df.AgeGroup.unique()
age_labels = sorted(age_group)
print age_labels


['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89']

In [32]:
ax = sns.barplot(x='AgeGroup', y='Survived', data=titanic_df, hue='Sex', order=age_labels)
ax.set_title('Survivors by Gender by Age groups')


Out[32]:
<matplotlib.text.Text at 0xc984080>

Conclusion

From the proportions above, and the visualization, taking into consideration the gender and age group - it is clearly visible that female and children were given preference in rescue operations by the other male passengers. 0-9 age group both male and female children had very high rate of survival.

Question 5 - What was male and female survival per class and by age?

Male and Female per Pclass

Lets review the males and females, per the passenger classes


In [33]:
sns.swarmplot(x='Pclass', y='Age', data=titanic_df, hue='Sex', dodge=True).set_title('Male and Female Passenger Ages by Class')


Out[33]:
<matplotlib.text.Text at 0xcc60e80>

Above we can see that compared to first and second class, there were large number of passengers in third class. Particularly males were in large number ... by the look of swarm, they were in large number in age 18 to 32.

To understand the age distribution of male and female in different class - the better plot is box plot.


In [34]:
sns.boxplot(x='Pclass', y='Age', data=titanic_df, hue='Sex').set_title('Comparison of Male and Female Passenger Ages by Class')


Out[34]:
<matplotlib.text.Text at 0xd0e6fd0>

From above we can make out that mean age of male and female in 3rd class was less than that of males and females in 2nd and 1st class. Highest mean age of males was in 1st class.

But this plot gives only idea of the distribution of ages of males and females per class.

Now lets try to understand the survival of male and female, per class

Male and Female Survival per Pclass, and by Age


In [35]:
def scatter(passengers, marker='o', legend_prefix=''):
    survived = passengers[passengers.Survived == 1]
    died = passengers[passengers.Survived == 0]

    x = survived.Age
    y = survived.Fare
    plt.scatter(x, y, c='blue', alpha=0.5, marker=marker, label=legend_prefix + ' Survived')

    x = died.Age
    y = died.Fare
    plt.scatter(x, y, c='red', alpha=0.5, marker=marker, label=legend_prefix + ' Died')

def scatter_by_class(pclass):
    class_passengers = titanic_df[titanic_df.Pclass == pclass]
    
    male_passengers = class_passengers[class_passengers.Sex == 'male']
    female_passengers = class_passengers[class_passengers.Sex == 'female']
    
    scatter(male_passengers, marker='o', legend_prefix='Male')
    scatter(female_passengers, marker='^', legend_prefix='Female')

    plt.legend(bbox_to_anchor=(0,1), loc='best') # bbox - to move legend out of plot/scatter
    plt.xlabel('Age')
    plt.ylabel('Fare')
    plt.title('Gender survival by Age, for Pclass = ' + str(pclass))

In [36]:
scatter_by_class(1)



In [37]:
scatter_by_class(2)



In [38]:
scatter_by_class(3)


The above three scatter plots give a view of male and female age and survival in each of the class.

But for better clarity and understanding, we can separate the scatter plots for male and female, per class. This is what we will do next.


In [39]:
def sns_scatter_by_class(pclass):
    fg = sns.FacetGrid(titanic_df[titanic_df['Pclass'] == pclass], 
                      col='Sex',
                      col_order=['male', 'female'],
                      hue='Survived', 
                      hue_kws=dict(marker=['v', '^']), 
                      size=6,
                      palette='Set1')
    fg = (fg.map(plt.scatter, 'Age', 'Fare', edgecolor='w', alpha=0.7, s=80).add_legend())
    plt.subplots_adjust(top=0.9)
    fg.fig.suptitle('Gender survival by Age, for CLASS {}'.format(pclass))

# plotted separately because male and female data in same scatter plot difficult to understand comparitive
sns_scatter_by_class(1)
sns_scatter_by_class(2)
sns_scatter_by_class(3)


Conclusion

From the just above scatter plots we have lot of clarity about male female age spread, and survivals.

We can notice the following

  • Females in first and second class were mostly all saved/survived.
  • In first and second class male and female kids (age group 0-10) almost all survived
  • In third class also the survival rate of female was higher than of males - but the female survival was less compared the female survival compared to 1st and 2nd class females

We can confirm the above observation by following barplot - showing the rate of survival by class, by sex


In [40]:
sns.barplot(x='Pclass', y='Survived', data=titanic_df, hue='Sex').set_title('Gender Survival by Class')


Out[40]:
<matplotlib.text.Text at 0xf27cfd0>

Overall Conclusion

Findings

  • Female survial rate was 74.3%, and male survival rate was 18.9% - so female survival rate was about 4 times that of males. Hence, female and children were given preference in rescue operations, and must have been saved by other male passengers.

  • 62.96 percent of 1st class passengers survived, whereas 3rd class passengers survival rate was 24.24% which is about one-third of the first class passengers. This is surprising that the 1st class passengers have high survival, that is they were given preference because of their social class.

  • 50% of passengers travelling with family survived, whereas survival percentage was 30% for those travelling alone. Hence survival rate was high for passengers travelling with family, as compared to those travelling alone.

  • Children had higer rate of survival as compared to adults

Limitations

  • Limitations in the analysis
    • Above only visualization, proportions, and percentages have been used to come to conclusions. However improvements could be made in the analysis by using statistical tests.
  • Limitations of the data
    • The dataset itself has some limitations - it has data missing for certain features/properties of passengers, like age
    • Also this data is only sample data from the population/full-data of titanic.
    • Missing data and sample size could skew the results, for example becuase of missing ages
    • Also we do not know if this sample was properly randomly selected

Future plans

Future work or potential areas to explore

  • Only 3 parameters were used in analysis - Age, Sex, and Passenger class - whereas more analysis is possible

  • More questions that can be asked are

    • Does a person with higher fare had higher chances of survival?
    • Does the deck of the cabin increased/effected the rate of survivval?
    • Does the survival had any relation with the title of a person (like Mr. Mrs, Miss, etc)